PILIPINAS IN A NUTSHELL (PINUT 2023)
"HIV AIDS is a disease with stigma. And we have learned with experience, not just with HIV AIDS but with other diseases countries for many reasons are sometimes hesitant to admit they have a problem."
– Margaret Chan
I. INTRODUCTION
It's time to look at the facts. 12026 entries. All women. What you are about to observe is the state of HIV in the Philippines last 2022.
This notebook will be divided into three parts. Data Cleaning. Data Exploration. Data Visualization. We will provide a walkthrough on how we were able to filter, analyze, and report on the dataset that we have obtained for this project.
II. DATA CLEANING
IMPORT PACKAGES
The main tool to be utilized in this endeavor will be Pandas, a library making use of the Python Programming Language (version 3.11) for Data Analysis. Before anything else, let us import the packages we need.
# Pandas Library (Data Analysis)
import pandas as pd
# Numpy and Scipy Libraries (Numerical and Scientific Computing)
import numpy as np
import scipy as sp
# Sklearn Library (Machine Learning)
import sklearn; from sklearn import *
# Plotly Library (Data Visualization)
import plotly.io as pio
import plotly.subplots as ps
import plotly.graph_objects as go
import plotly.express as px
# For Displaying Dataframes in Jupyter Notebooks
from IPython.display import display
Make sure to install the packages first if you haven't already.
Here are the commands for each library as of 2024 (type these in your Terminal):
- Pandas:
pip install pandas - Numpy:
pip install numpy - Scipy:
pip install scipy - Sklearn:
pip install scikit-learn - Plotly:
pip install plotly
READ THE DATASET FILES
It is now time to get our dataset. The .csv files themselves can be found in our GitHub Repository Page. We can use the read_csv() function in order for us to retrieve the files and store them as dataframes.
# Store the File Paths into Variables
dataset_Codes_FilePath = "dataset_Codes.csv"
dataset_Proper_FilePath = "dataset_Proper.csv"
# Read the Files from said Variables
dataset_Codes_DataFrame = pd.read_csv(dataset_Codes_FilePath)
dataset_Proper_DataFrame = pd.read_csv(dataset_Proper_FilePath)
# Create a Copy of the Dataframes
dataset_Codes_Copy = dataset_Codes_DataFrame.copy()
dataset_Proper_Copy = dataset_Proper_DataFrame.copy()
# Optional Display Settings
pd.set_option("display.max_columns", 5)
pd.set_option("display.max_rows", None)
pd.set_option("future.no_silent_downcasting", True)
pio.renderers.default = "notebook"
WHY TWO FILES?
Here's why. Let's take a peek into our two dataframes so far.
# These are the first five entries of the "Codes" Dataframe
dataset_Codes_Copy.head()
| CASEID | Case Identification | |
|---|---|---|
| 0 | V000 | Country code and phase |
| 1 | V001 | Cluster number |
| 2 | V002 | Household number |
| 3 | V003 | Respondent's line number |
| 4 | V004 | Ultimate area unit |
# These are the first five entries of the "Proper" Dataframe
dataset_Proper_Copy.head()
| V000 | V001 | ... | V867D | V867E | |
|---|---|---|---|---|---|
| 0 | PH8 | 1 | ... | ||
| 1 | PH8 | 1 | ... | ||
| 2 | PH8 | 1 | ... | ||
| 3 | PH8 | 1 | ... | ||
| 4 | PH8 | 1 | ... |
5 rows × 232 columns
Notice anything? Pay close attention to the entries of the CASEID Column from the Codes Dataframe.
Now, pay close attention to the column names of the Proper Dataframe.
Indeed. The reason why the Codes Dataframe is named as so is because it contains the codes of the actual survey questions asked in the Proper Dataframe. However, if we started working on the Proper Dataframe immediately, we would have a hard time seeing what question each column name corresponds to. We would have to tediously check both files, back and forth.
Let's make our lives easier. Let's rename all the columns of the Proper Dataframe. Sure, the column names will be long (for now), but at least we would have an idea of what question was asked without having to check on both files at the same time.
To do this, let us create a Dictionary with the list of Codes and their corresponding Survey Questions as the Key-Value pairs. Afterwards, let us use the rename() function in order to modify the column names of the Proper Dataframe.
# Create a Dictionary of the Codes mapped to their respective Column Names
dataset_Codes_Dictionary = dict(zip(dataset_Codes_Copy["CASEID"], dataset_Codes_Copy["Case Identification"]))
# Rename the Columns of the "Proper" Dataframe
dataset_Proper_Copy.rename(columns = dataset_Codes_Dictionary, inplace = True)
# Let us view our changes
dataset_Proper_Copy.head()
| Country code and phase | Cluster number | ... | Things happened because HIV positive status: healthcare workers talked badly | Things happened because HIV positive status: healthcare workers verbally abused | |
|---|---|---|---|---|---|
| 0 | PH8 | 1 | ... | ||
| 1 | PH8 | 1 | ... | ||
| 2 | PH8 | 1 | ... | ||
| 3 | PH8 | 1 | ... | ||
| 4 | PH8 | 1 | ... |
5 rows × 232 columns
COMPLETE OR DELETE MISSING VALUES
There are several missing values in the dataset. Don't believe me? Look at this then.
WARNING: For demonstration purposes only, I will be printing ALL of the original columns from the Proper Dataframe. Don't worry, we will be trimming this significantly as we go on.
For the meantime, kindly brace for a lengthy scroll ahead.
def dataFrame_Print_Missing_Info(dataframe: pd.DataFrame):
# Create a new Dataframe to hold our Missing Data Info
nullCount_dataFrame = pd.DataFrame(index = dataframe.columns)
# Count the Number and Percent of Cells containing only Empty Spaces and Null Values for each Column
nullCount_dataFrame.loc[:, "DataType"] = dataframe.dtypes
nullCount_dataFrame.loc[:, "TotalCount"] = dataframe.shape[0]
nullCount_dataFrame.loc[:, "NullCount"] = dataframe.apply(lambda x: x.eq(" ").sum() if x.dtype == "object" else x.isnull().sum())
nullCount_dataFrame.loc[:, "NullPercent"] = (((nullCount_dataFrame["NullCount"] / nullCount_dataFrame["TotalCount"]) * 100)).round(2)
# Display the Column Datatypes and the Info on Missing Data
display(nullCount_dataFrame)
dataFrame_Print_Missing_Info(dataset_Proper_Copy)
| DataType | TotalCount | NullCount | NullPercent | |
|---|---|---|---|---|
| Country code and phase | object | 27821 | 0 | 0.00 |
| Cluster number | int64 | 27821 | 0 | 0.00 |
| Household number | int64 | 27821 | 0 | 0.00 |
| Respondent's line number | int64 | 27821 | 0 | 0.00 |
| Ultimate area unit | int64 | 27821 | 0 | 0.00 |
| Women's individual sample weight (6 decimals) | int64 | 27821 | 0 | 0.00 |
| Month of interview | int64 | 27821 | 0 | 0.00 |
| Year of interview | int64 | 27821 | 0 | 0.00 |
| Date of interview (CMC) | int64 | 27821 | 0 | 0.00 |
| Date of interview Century Day Code (CDC) | int64 | 27821 | 0 | 0.00 |
| Respondent's month of birth | int64 | 27821 | 0 | 0.00 |
| Respondent's year of birth | int64 | 27821 | 0 | 0.00 |
| Date of birth (CMC) | int64 | 27821 | 0 | 0.00 |
| Respondent's current age | int64 | 27821 | 0 | 0.00 |
| Age in 5-year groups | int64 | 27821 | 0 | 0.00 |
| Completeness of age information | int64 | 27821 | 0 | 0.00 |
| Result of individual interview | int64 | 27821 | 0 | 0.00 |
| Day of interview | int64 | 27821 | 0 | 0.00 |
| CMC start of calendar | int64 | 27821 | 0 | 0.00 |
| Row of month of interview | int64 | 27821 | 0 | 0.00 |
| Length of calendar | int64 | 27821 | 0 | 0.00 |
| Number of calendar columns | int64 | 27821 | 0 | 0.00 |
| Ever-married sample | int64 | 27821 | 0 | 0.00 |
| Primary sampling unit | int64 | 27821 | 0 | 0.00 |
| Sample strata for sampling errors | int64 | 27821 | 0 | 0.00 |
| Stratification used in sample design | int64 | 27821 | 0 | 0.00 |
| Region | int64 | 27821 | 0 | 0.00 |
| Type of place of residence | int64 | 27821 | 0 | 0.00 |
| NA - De facto place of residence | object | 27821 | 27821 | 100.00 |
| Number of visits | int64 | 27821 | 0 | 0.00 |
| Interviewer identification | int64 | 27821 | 0 | 0.00 |
| NA - Keyer identification | object | 27821 | 27821 | 100.00 |
| Field supervisor | int64 | 27821 | 0 | 0.00 |
| NA - Field editor | object | 27821 | 27821 | 100.00 |
| NA - Office editor | object | 27821 | 27821 | 100.00 |
| Line number of husband | object | 27821 | 13373 | 48.07 |
| Cluster altitude in meters | int64 | 27821 | 0 | 0.00 |
| Household selected for hemoglobin | int64 | 27821 | 0 | 0.00 |
| Selected for Domestic Violence module | int64 | 27821 | 0 | 0.00 |
| Language of questionnaire | int64 | 27821 | 0 | 0.00 |
| Language of interview | int64 | 27821 | 0 | 0.00 |
| Native language of respondent | int64 | 27821 | 0 | 0.00 |
| Translator used | int64 | 27821 | 0 | 0.00 |
| Team number | int64 | 27821 | 0 | 0.00 |
| Team supervisor | int64 | 27821 | 0 | 0.00 |
| Region | int64 | 27821 | 0 | 0.00 |
| Type of place of residence | int64 | 27821 | 0 | 0.00 |
| NA - Childhood place of residence | object | 27821 | 27821 | 100.00 |
| Years lived in place of residence | int64 | 27821 | 0 | 0.00 |
| Type of place of previous residence | object | 27821 | 16458 | 59.16 |
| Region of previous residence | object | 27821 | 16458 | 59.16 |
| Highest educational level | int64 | 27821 | 0 | 0.00 |
| Highest year of education | object | 27821 | 282 | 1.01 |
| Ever heard of a Sexually Transmitted Infection (STI) | int64 | 27821 | 0 | 0.00 |
| Ever heard of AIDS | int64 | 27821 | 0 | 0.00 |
| NA - Reduce risk of getting HIV: do not have sex at all | object | 27821 | 27821 | 100.00 |
| Reduce risk of getting HIV: always use condoms during sex | object | 27821 | 18636 | 66.99 |
| Reduce risk of getting HIV: have 1 sex partner only, who has no other partners | object | 27821 | 18636 | 66.99 |
| Can get HIV from mosquito bites | object | 27821 | 18636 | 66.99 |
| Can get HIV by sharing food with person who has AIDS | object | 27821 | 18636 | 66.99 |
| A healthy looking person can have HIV | object | 27821 | 18636 | 66.99 |
| Condom used during last sex with most recent partner | object | 27821 | 11905 | 42.79 |
| Condom used during last sex with 2nd to most recent partner | object | 27821 | 27740 | 99.71 |
| Condom used during last sex with 3rd to most recent partner | object | 27821 | 27807 | 99.95 |
| Source of condoms used for last sex | object | 27821 | 27341 | 98.27 |
| Brand of condom used for last sex | object | 27821 | 27341 | 98.27 |
| Had any STI in last 12 months | int64 | 27821 | 0 | 0.00 |
| Had genital sore/ulcer in last 12 months | int64 | 27821 | 0 | 0.00 |
| Had genital discharge in last 12 months | int64 | 27821 | 0 | 0.00 |
| NA - Had CS STI in last 12 months | object | 27821 | 27821 | 100.00 |
| NA - Had CS STI in last 12 months | object | 27821 | 27821 | 100.00 |
| NA - Had CS STI in last 12 months | object | 27821 | 27821 | 100.00 |
| NA - Had CS STI in last 12 months | object | 27821 | 27821 | 100.00 |
| Number of sex partners, excluding spouse, in last 12 months | int64 | 27821 | 0 | 0.00 |
| Number of sex partners, including spouse, in last 12 months | int64 | 27821 | 0 | 0.00 |
| Relationship with most recent sex partner | object | 27821 | 11905 | 42.79 |
| Relationship with 2nd to most recent sex partner | object | 27821 | 27740 | 99.71 |
| Relationship with 3rd to most recent sex partner | object | 27821 | 27807 | 99.95 |
| NA - Length of time knows last partner | object | 27821 | 27821 | 100.00 |
| NA - Length of time knows other partner (1) | object | 27821 | 27821 | 100.00 |
| NA - Length of time knows other partner (2) | object | 27821 | 27821 | 100.00 |
| NA - Sought advice/treatment for last STI infection | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: government hospital | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS public | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS public | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS public | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS public | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS public | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS public | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS public | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS public | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS public | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: private hospital/clinic/doctor | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS private | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS private | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS private | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS private | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS private | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS private | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS private | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS private | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS other | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS other | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS other | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: CS other | object | 27821 | 27821 | 100.00 |
| NA - Sought STI advice/treatment from: other | object | 27821 | 27821 | 100.00 |
| NA - HIV transmitted during pregnancy | object | 27821 | 27821 | 100.00 |
| NA - HIV transmitted during delivery | object | 27821 | 27821 | 100.00 |
| NA - HIV transmitted by breastfeeding | object | 27821 | 27821 | 100.00 |
| NA - Knows someone who has, or is suspected of having, HIV | object | 27821 | 27821 | 100.00 |
| NA - Would want HIV infection in family to remain secret | object | 27821 | 27821 | 100.00 |
| NA - Would be ashamed if someone in the family had HIV | object | 27821 | 27821 | 100.00 |
| NA - Willing to care for relative with AIDS | object | 27821 | 27821 | 100.00 |
| NA - A female teacher infected with HIV, but is not sick, should be allowed to continue teaching | object | 27821 | 27821 | 100.00 |
| NA - Children should be taught about condoms to avoid AIDS | object | 27821 | 27821 | 100.00 |
| Ever been tested for HIV | int64 | 27821 | 0 | 0.00 |
| NA - Know a place to get HIV test | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: government hospital | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS public | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS public | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS public | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS public | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS public | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS public | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS public | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS public | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS public | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: private hospital/clinic/doctor | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS private | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS private | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS private | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS private | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS private | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS private | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS private | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS private | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS other | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS other | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: CS other | object | 27821 | 27821 | 100.00 |
| NA - Place for HIV test: other | object | 27821 | 27821 | 100.00 |
| Heard about other STIs | int64 | 27821 | 0 | 0.00 |
| NA - Last 12 months had sex in return for gifts, cash or other | object | 27821 | 27821 | 100.00 |
| Used a method at last sexual intercourse | object | 27821 | 12805 | 46.03 |
| Method used last sexual intercourse: female sterilization | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: male sterilization | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: IUD | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: injectables | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: implants | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: pill | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: condom | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: female condom | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: emergency contraception | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: standard days method | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: lactational amenorrhea | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: rhythm | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: withdrawal | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: patch | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: mucus/billings/ovulation | object | 27821 | 19398 | 69.72 |
| NA - Method used last sexual intercourse: CS | object | 27821 | 27821 | 100.00 |
| Method used last sexual intercourse: other modern | object | 27821 | 19398 | 69.72 |
| Method used last sexual intercourse: other traditional | object | 27821 | 19398 | 69.72 |
| NA - Condom used at first sex | object | 27821 | 27821 | 100.00 |
| NA - Most recent sex partner younger, the same age or older | object | 27821 | 27821 | 100.00 |
| NA - 2nd to most recent sex partner younger, the same age or older | object | 27821 | 27821 | 100.00 |
| NA - 3rd to most recent sex partner younger, the same age or older | object | 27821 | 27821 | 100.00 |
| Wife justified asking husband to use condom if he has STI | int64 | 27821 | 0 | 0.00 |
| NA - Can get HIV by witchcraft or supernatural means | object | 27821 | 27821 | 100.00 |
| Drugs to avoid HIV transmission to baby during pregnancy | object | 27821 | 2537 | 9.12 |
| Would buy vegetables from vendor with HIV | object | 27821 | 2537 | 9.12 |
| NA - Last time tested for HIV | object | 27821 | 27821 | 100.00 |
| Months ago most recent HIV test | object | 27821 | 25577 | 91.93 |
| NA - Last HIV test: on your own, offered or required | object | 27821 | 27821 | 100.00 |
| Received result from last HIV test | object | 27821 | 25577 | 91.93 |
| Place where last HIV test was taken | object | 27821 | 25577 | 91.93 |
| NA - Age of first sex partner | object | 27821 | 27821 | 100.00 |
| NA - First sex partner younger, same age or older | object | 27821 | 27821 | 100.00 |
| NA - Time since last sex with 2nd to most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Time since last sex with 3rd to most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Used condom every time had sex with most recent partner in last 12 months | object | 27821 | 27821 | 100.00 |
| NA - Used condom every time had sex with 2nd to most recent partner in last 12 months | object | 27821 | 27821 | 100.00 |
| NA - Used condom every time had sex with 3rd to most recent partner in last 12 months | object | 27821 | 27821 | 100.00 |
| NA - Age of most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Age of 2nd to most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Age of 3rd to most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Alcohol consumption at last sex with most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Alcohol consumption at last sex with 2nd to most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Alcohol consumption at last sex with 3rd to most recent partner | object | 27821 | 27821 | 100.00 |
| Total lifetime number of sex partners | object | 27821 | 9425 | 33.88 |
| Heard of ARVs to treat HIV | object | 27821 | 2537 | 9.12 |
| NA - During antenatal visit talked about: HIV transmitted mother to child | object | 27821 | 27821 | 100.00 |
| NA - During antenatal visit talked about: things to do to prevent getting HIV | object | 27821 | 27821 | 100.00 |
| NA - During antenatal visit talked about: getting tested for HIV | object | 27821 | 27821 | 100.00 |
| NA - Offered HIV test as part of antenatal visit | object | 27821 | 27821 | 100.00 |
| NA - Offered HIV test between the time went for delivery and before baby was born | object | 27821 | 27821 | 100.00 |
| NA - Tested for HIV as part of antenatal visit | object | 27821 | 27821 | 100.00 |
| NA - Tested for HIV between the time went for delivery and before baby was born | object | 27821 | 27821 | 100.00 |
| NA - Got results of HIV test as part of antenatal visit | object | 27821 | 27821 | 100.00 |
| NA - Got results of HIV test when tested before baby was born | object | 27821 | 27821 | 100.00 |
| NA - Place were HIV test was taken as part of antenatal visit | object | 27821 | 27821 | 100.00 |
| NA - Tested for HIV since antenatal visit test | object | 27821 | 27821 | 100.00 |
| Respondent can refuse sex | object | 27821 | 12299 | 44.21 |
| Respondent can ask partner to use a condom | object | 27821 | 12299 | 44.21 |
| NA - How long ago first had sex with most recent partner | object | 27821 | 27821 | 100.00 |
| NA - How long ago first had sex with 2nd most recent partner | object | 27821 | 27821 | 100.00 |
| NA - How long ago first had sex with 3rd most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Times in last 12 months had sex with most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Times in last 12 months had sex with 2nd most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Times in last 12 months had sex with 3rd most recent partner | object | 27821 | 27821 | 100.00 |
| NA - Received counseling after tested for AIDS during antenatal care | object | 27821 | 27821 | 100.00 |
| Knowledge and use of HIV test kits | object | 27821 | 2537 | 9.12 |
| Children with HIV should be allowed to attend school with children without HIV | object | 27821 | 2537 | 9.12 |
| NA - People hesitate to take HIV test because reaction of other people if positive | object | 27821 | 27821 | 100.00 |
| NA - People talk badly about people with or believed to have HIV | object | 27821 | 27821 | 100.00 |
| NA - People with or believed to have HIV lose respect from other people | object | 27821 | 27821 | 100.00 |
| NA - Would be afraid to get HIV from contact with saliva from infected person | object | 27821 | 27821 | 100.00 |
| NA - Knowledge and attitude to PrEP to prevent getting HIV | object | 27821 | 27821 | 100.00 |
| Month of most recent HIV test | object | 27821 | 25577 | 91.93 |
| Year of most recent HIV test | object | 27821 | 25577 | 91.93 |
| Date of most recent HIV test (CMC) | object | 27821 | 25577 | 91.93 |
| Result of HIV test | object | 27821 | 25738 | 92.51 |
| Month received first HIV test positive | object | 27821 | 27820 | 100.00 |
| Year received first HIV test positive | object | 27821 | 27820 | 100.00 |
| Date received first HIV test positive (CMC) | object | 27821 | 27820 | 100.00 |
| Currently taking ARVs | object | 27821 | 27820 | 100.00 |
| Number of HIV tests | object | 27821 | 25577 | 91.93 |
| Disclosed HIV status to others | object | 27821 | 27820 | 100.00 |
| Respondent feels ashamed of HIV status | object | 27821 | 27820 | 100.00 |
| Things happened because HIV positive status: people talk badly | object | 27821 | 27820 | 100.00 |
| Things happened because HIV positive status: someone else disclosed status | object | 27821 | 27820 | 100.00 |
| Things happened because HIV positive status: verbally insulted/harassed/threatened | object | 27821 | 27820 | 100.00 |
| Things happened because HIV positive status: healthcare workers talked badly | object | 27821 | 27820 | 100.00 |
| Things happened because HIV positive status: healthcare workers verbally abused | object | 27821 | 27820 | 100.00 |
Okay. Let's take this step by step. What insights can we gain from this?
We have 232 survey questions. As you can clearly see, these are a lot. However, we won't actually be needing all of them. I will explain why later.
There are only two data types in the entire dataset: 64-Bit Integers and String Objects. Let's first deal with cleaning the Objects first before touching the Integers.
As suspected, there are several blank answers. All of these are actually strings with an empty space. These are what we need to filter out next.
Some columns don't even have a single entry at all. This tells us something important: not every column should be treated equal. There are apparently columns that provide no value to us in this Project.
In order for us to remove the missing entries, let's first replace all the empty spaces with NumPy NaN (Not a Number) values instead. This way, we can then simply use the dropna() function in order for us to delete these Null data points.
Here are our conditions. If more than half the data of a column is missing, we remove the entire column from our dataframe. Afterwards, any respondent that still has at least one missing answer will also get removed from our dataframe. For the purposes of the Project, doing both of these is much easier as compared to trying to find a way on how we could manually fill in the missing data points (though, doing this is also a possibility). For now, however, let us see what we have accomplished so far.
# Replace all Empty Spaces with NaN Values
dataset_Proper_Copy.replace(" ", np.NaN, inplace = True)
# Drop all Survey Questions with More than Half their Data Missing ((27821 * 0.5) < 13911)
dataset_Proper_Copy.dropna(axis = 1, thresh = 13911, inplace = True)
# Drop all Rows with at least One Missing Value
dataset_Proper_Copy.dropna(axis = 0, how = "any", inplace = True)
# Display the Newly Filtered Dataframe
dataFrame_Print_Missing_Info(dataset_Proper_Copy)
| DataType | TotalCount | NullCount | NullPercent | |
|---|---|---|---|---|
| Country code and phase | object | 12026 | 0 | 0.0 |
| Cluster number | int64 | 12026 | 0 | 0.0 |
| Household number | int64 | 12026 | 0 | 0.0 |
| Respondent's line number | int64 | 12026 | 0 | 0.0 |
| Ultimate area unit | int64 | 12026 | 0 | 0.0 |
| Women's individual sample weight (6 decimals) | int64 | 12026 | 0 | 0.0 |
| Month of interview | int64 | 12026 | 0 | 0.0 |
| Year of interview | int64 | 12026 | 0 | 0.0 |
| Date of interview (CMC) | int64 | 12026 | 0 | 0.0 |
| Date of interview Century Day Code (CDC) | int64 | 12026 | 0 | 0.0 |
| Respondent's month of birth | int64 | 12026 | 0 | 0.0 |
| Respondent's year of birth | int64 | 12026 | 0 | 0.0 |
| Date of birth (CMC) | int64 | 12026 | 0 | 0.0 |
| Respondent's current age | int64 | 12026 | 0 | 0.0 |
| Age in 5-year groups | int64 | 12026 | 0 | 0.0 |
| Completeness of age information | int64 | 12026 | 0 | 0.0 |
| Result of individual interview | int64 | 12026 | 0 | 0.0 |
| Day of interview | int64 | 12026 | 0 | 0.0 |
| CMC start of calendar | int64 | 12026 | 0 | 0.0 |
| Row of month of interview | int64 | 12026 | 0 | 0.0 |
| Length of calendar | int64 | 12026 | 0 | 0.0 |
| Number of calendar columns | int64 | 12026 | 0 | 0.0 |
| Ever-married sample | int64 | 12026 | 0 | 0.0 |
| Primary sampling unit | int64 | 12026 | 0 | 0.0 |
| Sample strata for sampling errors | int64 | 12026 | 0 | 0.0 |
| Stratification used in sample design | int64 | 12026 | 0 | 0.0 |
| Region | int64 | 12026 | 0 | 0.0 |
| Type of place of residence | int64 | 12026 | 0 | 0.0 |
| Number of visits | int64 | 12026 | 0 | 0.0 |
| Interviewer identification | int64 | 12026 | 0 | 0.0 |
| Field supervisor | int64 | 12026 | 0 | 0.0 |
| Line number of husband | object | 12026 | 0 | 0.0 |
| Cluster altitude in meters | int64 | 12026 | 0 | 0.0 |
| Household selected for hemoglobin | int64 | 12026 | 0 | 0.0 |
| Selected for Domestic Violence module | int64 | 12026 | 0 | 0.0 |
| Language of questionnaire | int64 | 12026 | 0 | 0.0 |
| Language of interview | int64 | 12026 | 0 | 0.0 |
| Native language of respondent | int64 | 12026 | 0 | 0.0 |
| Translator used | int64 | 12026 | 0 | 0.0 |
| Team number | int64 | 12026 | 0 | 0.0 |
| Team supervisor | int64 | 12026 | 0 | 0.0 |
| Region | int64 | 12026 | 0 | 0.0 |
| Type of place of residence | int64 | 12026 | 0 | 0.0 |
| Years lived in place of residence | int64 | 12026 | 0 | 0.0 |
| Highest educational level | int64 | 12026 | 0 | 0.0 |
| Highest year of education | object | 12026 | 0 | 0.0 |
| Ever heard of a Sexually Transmitted Infection (STI) | int64 | 12026 | 0 | 0.0 |
| Ever heard of AIDS | int64 | 12026 | 0 | 0.0 |
| Condom used during last sex with most recent partner | object | 12026 | 0 | 0.0 |
| Had any STI in last 12 months | int64 | 12026 | 0 | 0.0 |
| Had genital sore/ulcer in last 12 months | int64 | 12026 | 0 | 0.0 |
| Had genital discharge in last 12 months | int64 | 12026 | 0 | 0.0 |
| Number of sex partners, excluding spouse, in last 12 months | int64 | 12026 | 0 | 0.0 |
| Number of sex partners, including spouse, in last 12 months | int64 | 12026 | 0 | 0.0 |
| Relationship with most recent sex partner | object | 12026 | 0 | 0.0 |
| Ever been tested for HIV | int64 | 12026 | 0 | 0.0 |
| Heard about other STIs | int64 | 12026 | 0 | 0.0 |
| Used a method at last sexual intercourse | object | 12026 | 0 | 0.0 |
| Wife justified asking husband to use condom if he has STI | int64 | 12026 | 0 | 0.0 |
| Drugs to avoid HIV transmission to baby during pregnancy | object | 12026 | 0 | 0.0 |
| Would buy vegetables from vendor with HIV | object | 12026 | 0 | 0.0 |
| Total lifetime number of sex partners | object | 12026 | 0 | 0.0 |
| Heard of ARVs to treat HIV | object | 12026 | 0 | 0.0 |
| Respondent can refuse sex | object | 12026 | 0 | 0.0 |
| Respondent can ask partner to use a condom | object | 12026 | 0 | 0.0 |
| Knowledge and use of HIV test kits | object | 12026 | 0 | 0.0 |
| Children with HIV should be allowed to attend school with children without HIV | object | 12026 | 0 | 0.0 |
DROP IRRELEVANT SURVEY QUESTIONS
We have now significantly reduced the total number of columns in our dataframe. However, we are not done yet. Like what I said, we don't need to know all of these details. Here is a list of columns that we will be deleting because they are unnecessary for the Project. Let's drop them all using the drop() function.
# Curate a List of the Columns to be Deleted
unnecessary_Columns = [
"Country code and phase",
"Cluster number",
"Household number",
"Respondent's line number",
"Ultimate area unit",
"Women's individual sample weight (6 decimals)",
"Month of interview",
"Year of interview",
"Date of interview (CMC)",
"Date of interview Century Day Code (CDC)",
"Date of birth (CMC)",
"Completeness of age information",
"Result of individual interview",
"Day of interview",
"CMC start of calendar",
"Row of month of interview",
"Length of calendar",
"Number of calendar columns",
"Ever-married sample",
"Primary sampling unit",
"Sample strata for sampling errors",
"Stratification used in sample design",
"Number of visits",
"Interviewer identification",
"Field supervisor",
"Line number of husband",
"Cluster altitude in meters",
"Household selected for hemoglobin",
"Selected for Domestic Violence module",
"Years lived in place of residence",
"Team number",
"Team supervisor"
]
# Drop all the Irrelevant Columns
dataset_Proper_Copy.drop(columns = unnecessary_Columns, inplace = True)
REMOVE DUPLICATES AND RESET INDICES
Next, let's do the following:
- Remove all Duplicate Columns.
- Reset the Column Indices.
# Remove all Duplicate Columns
dataset_Proper_Copy = dataset_Proper_Copy.loc[:, ~dataset_Proper_Copy.columns.duplicated()]
# Reset the Column Indices
dataset_Proper_Copy.reset_index(drop = True, inplace = True)
# Let us view our changes
dataset_Proper_Copy.head()
| Respondent's month of birth | Respondent's year of birth | ... | Knowledge and use of HIV test kits | Children with HIV should be allowed to attend school with children without HIV | |
|---|---|---|---|---|---|
| 0 | 8 | 1977 | ... | 0 | 8 |
| 1 | 9 | 2000 | ... | 0 | 8 |
| 2 | 7 | 1993 | ... | 0 | 1 |
| 3 | 7 | 1980 | ... | 0 | 0 |
| 4 | 5 | 1982 | ... | 0 | 8 |
5 rows × 33 columns
CONVERT DATATYPES
Finally, it's time to deal with the Datatypes. Ideally, we would want to work with a single data type all throughout the entire dataframe. More specifically, since our Project focuses on finding a numerical relationship between certain Features and overall HIV Perception, we would like to work with Integers.
However, since this is a Survey Questionnaire, I am not so convinced that every single String entry here can not be encoded into Integer values. Let's investigate this theory. What exactly do all the Object Columns really contain?
# Find the actual Values each Object Column contains
for col in dataset_Proper_Copy.columns:
if dataset_Proper_Copy[col].dtype == "object":
display(dataset_Proper_Copy[col].value_counts())
Highest year of education 4 4694 6 3347 2 1412 3 1110 1 1004 5 429 8 20 98 6 7 4 Name: count, dtype: int64
Condom used during last sex with most recent partner 0 11737 1 289 Name: count, dtype: int64
Relationship with most recent sex partner 1 8508 7 3506 2 10 4 2 Name: count, dtype: int64
Used a method at last sexual intercourse 1 7071 0 4955 Name: count, dtype: int64
Drugs to avoid HIV transmission to baby during pregnancy 1 6465 0 3537 8 2024 Name: count, dtype: int64
Would buy vegetables from vendor with HIV 0 6965 1 4345 8 716 Name: count, dtype: int64
Total lifetime number of sex partners 1 9481 2 1953 3 405 4 91 5 54 98 10 6 9 8 7 95 4 10 4 7 3 15 3 13 1 20 1 Name: count, dtype: int64
Heard of ARVs to treat HIV 0 8007 1 4019 Name: count, dtype: int64
Respondent can refuse sex 1 11003 0 896 8 127 Name: count, dtype: int64
Respondent can ask partner to use a condom 1 9073 0 2473 8 480 Name: count, dtype: int64
Knowledge and use of HIV test kits 0 9635 2 2303 1 88 Name: count, dtype: int64
Children with HIV should be allowed to attend school with children without HIV 0 6614 1 4571 8 841 Name: count, dtype: int64
My hunch was correct. These are Strings that can easily be converted into actual Integer values. Using the astype() function, let's change the data type of all the Object Columns into 64-Bit Integers.
# Convert each Object Column into an Integer Column
for col in dataset_Proper_Copy.columns:
if dataset_Proper_Copy[col].dtype == "object":
dataset_Proper_Copy[col] = dataset_Proper_Copy[col].astype(np.int64)
# Display the Newly Filtered Dataframe
dataFrame_Print_Missing_Info(dataset_Proper_Copy)
| DataType | TotalCount | NullCount | NullPercent | |
|---|---|---|---|---|
| Respondent's month of birth | int64 | 12026 | 0 | 0.0 |
| Respondent's year of birth | int64 | 12026 | 0 | 0.0 |
| Respondent's current age | int64 | 12026 | 0 | 0.0 |
| Age in 5-year groups | int64 | 12026 | 0 | 0.0 |
| Region | int64 | 12026 | 0 | 0.0 |
| Type of place of residence | int64 | 12026 | 0 | 0.0 |
| Language of questionnaire | int64 | 12026 | 0 | 0.0 |
| Language of interview | int64 | 12026 | 0 | 0.0 |
| Native language of respondent | int64 | 12026 | 0 | 0.0 |
| Translator used | int64 | 12026 | 0 | 0.0 |
| Highest educational level | int64 | 12026 | 0 | 0.0 |
| Highest year of education | int64 | 12026 | 0 | 0.0 |
| Ever heard of a Sexually Transmitted Infection (STI) | int64 | 12026 | 0 | 0.0 |
| Ever heard of AIDS | int64 | 12026 | 0 | 0.0 |
| Condom used during last sex with most recent partner | int64 | 12026 | 0 | 0.0 |
| Had any STI in last 12 months | int64 | 12026 | 0 | 0.0 |
| Had genital sore/ulcer in last 12 months | int64 | 12026 | 0 | 0.0 |
| Had genital discharge in last 12 months | int64 | 12026 | 0 | 0.0 |
| Number of sex partners, excluding spouse, in last 12 months | int64 | 12026 | 0 | 0.0 |
| Number of sex partners, including spouse, in last 12 months | int64 | 12026 | 0 | 0.0 |
| Relationship with most recent sex partner | int64 | 12026 | 0 | 0.0 |
| Ever been tested for HIV | int64 | 12026 | 0 | 0.0 |
| Heard about other STIs | int64 | 12026 | 0 | 0.0 |
| Used a method at last sexual intercourse | int64 | 12026 | 0 | 0.0 |
| Wife justified asking husband to use condom if he has STI | int64 | 12026 | 0 | 0.0 |
| Drugs to avoid HIV transmission to baby during pregnancy | int64 | 12026 | 0 | 0.0 |
| Would buy vegetables from vendor with HIV | int64 | 12026 | 0 | 0.0 |
| Total lifetime number of sex partners | int64 | 12026 | 0 | 0.0 |
| Heard of ARVs to treat HIV | int64 | 12026 | 0 | 0.0 |
| Respondent can refuse sex | int64 | 12026 | 0 | 0.0 |
| Respondent can ask partner to use a condom | int64 | 12026 | 0 | 0.0 |
| Knowledge and use of HIV test kits | int64 | 12026 | 0 | 0.0 |
| Children with HIV should be allowed to attend school with children without HIV | int64 | 12026 | 0 | 0.0 |
NUMERICAL SUMMARY OF THE DATAFRAME
We have now finished the actual cleaning part of the Project. From here on out, up until the Data Exploration Section, we will be focusing on enhancing the Proper Dataframe more than anything else.
To start, let's take a step back and observe what we have done. Let's determine the high-level, numerical summary of the data that we are dealing with so far.
def dataFrame_Print_Numerical_Description(dataframe: pd.DataFrame):
# Create a new Dataframe to hold our Missing Data Info
description_dataFrame = pd.DataFrame(index = dataframe.columns)
# Determine the Min, Mean, Median, Mode, and Max Values for each Column
for col in dataframe.columns:
description_dataFrame.loc[col, "MIN"] = dataframe[col].min().round(2)
description_dataFrame.loc[col, "MEAN"] = dataframe[col].mean().round(2)
description_dataFrame.loc[col, "MEDIAN"] = dataframe[col].median().round(2)
description_dataFrame.loc[col, "MODE"] = dataframe[col].mode().round(2).iloc[0]
description_dataFrame.loc[col, "MAX"] = dataframe[col].max().round(2)
# Display a Numerical Description of the Dataframe
display(description_dataFrame)
dataFrame_Print_Numerical_Description(dataset_Proper_Copy)
| MIN | MEAN | MEDIAN | MODE | MAX | |
|---|---|---|---|---|---|
| Respondent's month of birth | 1.0 | 6.66 | 7.0 | 12.0 | 12.0 |
| Respondent's year of birth | 1972.0 | 1985.55 | 1985.0 | 1979.0 | 2007.0 |
| Respondent's current age | 15.0 | 35.87 | 36.0 | 42.0 | 49.0 |
| Age in 5-year groups | 1.0 | 4.76 | 5.0 | 6.0 | 7.0 |
| Region | 1.0 | 9.09 | 9.0 | 13.0 | 17.0 |
| Type of place of residence | 1.0 | 1.60 | 2.0 | 2.0 | 2.0 |
| Language of questionnaire | 1.0 | 3.94 | 2.0 | 2.0 | 7.0 |
| Language of interview | 1.0 | 8.08 | 3.0 | 2.0 | 96.0 |
| Native language of respondent | 1.0 | 16.87 | 6.0 | 2.0 | 96.0 |
| Translator used | 0.0 | 0.08 | 0.0 | 0.0 | 1.0 |
| Highest educational level | 1.0 | 2.22 | 2.0 | 2.0 | 3.0 |
| Highest year of education | 1.0 | 4.07 | 4.0 | 4.0 | 98.0 |
| Ever heard of a Sexually Transmitted Infection (STI) | 1.0 | 1.00 | 1.0 | 1.0 | 1.0 |
| Ever heard of AIDS | 1.0 | 1.00 | 1.0 | 1.0 | 1.0 |
| Condom used during last sex with most recent partner | 0.0 | 0.02 | 0.0 | 0.0 | 1.0 |
| Had any STI in last 12 months | 0.0 | 0.01 | 0.0 | 0.0 | 8.0 |
| Had genital sore/ulcer in last 12 months | 0.0 | 0.05 | 0.0 | 0.0 | 8.0 |
| Had genital discharge in last 12 months | 0.0 | 0.08 | 0.0 | 0.0 | 8.0 |
| Number of sex partners, excluding spouse, in last 12 months | 0.0 | 0.00 | 0.0 | 0.0 | 2.0 |
| Number of sex partners, including spouse, in last 12 months | 1.0 | 1.03 | 1.0 | 1.0 | 98.0 |
| Relationship with most recent sex partner | 1.0 | 2.75 | 1.0 | 1.0 | 7.0 |
| Ever been tested for HIV | 0.0 | 0.12 | 0.0 | 0.0 | 1.0 |
| Heard about other STIs | 0.0 | 0.37 | 0.0 | 0.0 | 1.0 |
| Used a method at last sexual intercourse | 0.0 | 0.59 | 1.0 | 1.0 | 1.0 |
| Wife justified asking husband to use condom if he has STI | 0.0 | 0.98 | 1.0 | 1.0 | 8.0 |
| Drugs to avoid HIV transmission to baby during pregnancy | 0.0 | 1.88 | 1.0 | 1.0 | 8.0 |
| Would buy vegetables from vendor with HIV | 0.0 | 0.84 | 0.0 | 0.0 | 8.0 |
| Total lifetime number of sex partners | 1.0 | 1.40 | 1.0 | 1.0 | 98.0 |
| Heard of ARVs to treat HIV | 0.0 | 0.33 | 0.0 | 0.0 | 1.0 |
| Respondent can refuse sex | 0.0 | 1.00 | 1.0 | 1.0 | 8.0 |
| Respondent can ask partner to use a condom | 0.0 | 1.07 | 1.0 | 1.0 | 8.0 |
| Knowledge and use of HIV test kits | 0.0 | 0.39 | 0.0 | 0.0 | 2.0 |
| Children with HIV should be allowed to attend school with children without HIV | 0.0 | 0.94 | 0.0 | 0.0 | 8.0 |
THESE ARE ALRIGHT... BUT WE CAN DO MORE.
Technically speaking, we can already start analyzing the data as it already is. However, there are a few more things I would like to modify here before wrapping everything up:
First of all, while it is true that all of these Survey Questions would be useful to us in one way or another, I think we ought to only keep the columns that are most relevant to the Research Questions. For instance, while it is great to know the month and year of a respondent's birth, I think knowing her age would be more than enough information for us.
Second, while most of these Survey Questions have either a "YES" or a "NO" answer to them, there are a few items that allow for either an "I DON'T KNOW" answer or different variations of "YES" and "NO". To keep it clean and simple, and since we are talking about HIV Perception anyway, let's make all "I DON'T KNOW" answers count as
0's and all remaining choices count as1's. Also, some columns need to be flipped as saying "YES" could mean a bad thing and vice versa.
Let's first split the Proper Dataframe into six sub-components after dropping the columns: Age, Region, Residence, Language, Educational Level, and Perception. Afterwards, let's go on and finish this Section from here.
# Curate a List of the Additional Columns to be Deleted
unnecessary_Columns = [
"Respondent's current age",
"Respondent's month of birth",
"Respondent's year of birth",
"Language of questionnaire",
"Language of interview",
"Translator used",
"Highest year of education",
"Number of sex partners, excluding spouse, in last 12 months",
"Number of sex partners, including spouse, in last 12 months",
"Relationship with most recent sex partner",
"Total lifetime number of sex partners"
]
# Drop all the Irrelevant Columns
dataset_Proper_Copy.drop(columns = unnecessary_Columns, inplace = True)
# Split the Dataframe into the Six Sub-Components
age_dataFrame = dataset_Proper_Copy["Age in 5-year groups"]
region_dataFrame = dataset_Proper_Copy["Region"]
residence_dataFrame = dataset_Proper_Copy["Type of place of residence"]
language_dataFrame = dataset_Proper_Copy["Native language of respondent"]
educLevel_dataFrame = dataset_Proper_Copy["Highest educational level"]
perception_dataFrame = dataset_Proper_Copy.iloc[:, 5:22]
# Replace the Extra Perception Dataframe Values
perception_dataFrame.replace(8, 0, inplace = True)
perception_dataFrame.replace(2, 1, inplace = True)
# Switch the Values of these Columns
cols_to_flip = ["Had any STI in last 12 months", "Had genital sore/ulcer in last 12 months", "Had genital discharge in last 12 months"]
perception_dataFrame[cols_to_flip] = perception_dataFrame[cols_to_flip].map(lambda x: 0 if x == 1 else 1)
# Concatenate all Six Dataframes back into just One
list_of_sixDataFrames = [age_dataFrame, region_dataFrame, residence_dataFrame, language_dataFrame, educLevel_dataFrame, perception_dataFrame]
merged_dataFrame = pd.concat(list_of_sixDataFrames, axis = 1)
# Display the Numerical Description of the new Dataframe
dataFrame_Print_Numerical_Description(merged_dataFrame)
| MIN | MEAN | MEDIAN | MODE | MAX | |
|---|---|---|---|---|---|
| Age in 5-year groups | 1.0 | 4.76 | 5.0 | 6.0 | 7.0 |
| Region | 1.0 | 9.09 | 9.0 | 13.0 | 17.0 |
| Type of place of residence | 1.0 | 1.60 | 2.0 | 2.0 | 2.0 |
| Native language of respondent | 1.0 | 16.87 | 6.0 | 2.0 | 96.0 |
| Highest educational level | 1.0 | 2.22 | 2.0 | 2.0 | 3.0 |
| Ever heard of a Sexually Transmitted Infection (STI) | 1.0 | 1.00 | 1.0 | 1.0 | 1.0 |
| Ever heard of AIDS | 1.0 | 1.00 | 1.0 | 1.0 | 1.0 |
| Condom used during last sex with most recent partner | 0.0 | 0.02 | 0.0 | 0.0 | 1.0 |
| Had any STI in last 12 months | 0.0 | 0.99 | 1.0 | 1.0 | 1.0 |
| Had genital sore/ulcer in last 12 months | 0.0 | 0.97 | 1.0 | 1.0 | 1.0 |
| Had genital discharge in last 12 months | 0.0 | 0.94 | 1.0 | 1.0 | 1.0 |
| Ever been tested for HIV | 0.0 | 0.12 | 0.0 | 0.0 | 1.0 |
| Heard about other STIs | 0.0 | 0.37 | 0.0 | 0.0 | 1.0 |
| Used a method at last sexual intercourse | 0.0 | 0.59 | 1.0 | 1.0 | 1.0 |
| Wife justified asking husband to use condom if he has STI | 0.0 | 0.78 | 1.0 | 1.0 | 1.0 |
| Drugs to avoid HIV transmission to baby during pregnancy | 0.0 | 0.54 | 1.0 | 1.0 | 1.0 |
| Would buy vegetables from vendor with HIV | 0.0 | 0.36 | 0.0 | 0.0 | 1.0 |
| Heard of ARVs to treat HIV | 0.0 | 0.33 | 0.0 | 0.0 | 1.0 |
| Respondent can refuse sex | 0.0 | 0.91 | 1.0 | 1.0 | 1.0 |
| Respondent can ask partner to use a condom | 0.0 | 0.75 | 1.0 | 1.0 | 1.0 |
| Knowledge and use of HIV test kits | 0.0 | 0.20 | 0.0 | 0.0 | 1.0 |
| Children with HIV should be allowed to attend school with children without HIV | 0.0 | 0.38 | 0.0 | 0.0 | 1.0 |
II. DATA EXPLORATION
DETERMINE THE OVERALL PERCEPTION MEAN
Now, let's finally start exploring and analyzing our data. There are two things I would like to do first:
- Let's rename some of these columns so that they become more concise and easier to read.
- Let's try to quantify the "Perception" of a Respondent.
There are different ways of going about measuring perception. One way is through simply getting the Average across all the Survey Questions. This is a quick and easy way to going about doing this but, of course, there are other ways too (such as through a Weighted Average, let's say). For now, let's do these and see what we get.
# Create a Dictionary of the Column Names to be Replaced
rename_toLessWords_Dict = {
"Ever heard of a Sexually Transmitted Infection (STI)": "Ever heard of a STI",
"Condom used during last sex with most recent partner": "Condom used during last sex",
"Had any STI in last 12 months": "Had any STI in last year",
"Had genital sore/ulcer in last 12 months": "Had genital sore/ulcer last year",
"Had genital discharge in last 12 months": "Had genital discharge last year",
"Used a method at last sexual intercourse": "Used a method last sex",
"Wife justified asking husband to use condom if he has STI": "Justified asking husband with STI to use condom",
"Drugs to avoid HIV transmission to baby during pregnancy": "Knows Anti-HIV transmission to baby Drugs",
"Would buy vegetables from vendor with HIV": "Would buy vegetables from vendor with HIV",
"Heard of ARVs to treat HIV": "Heard of ARVs",
"Respondent can ask partner to use a condom": "Can ask to use condom",
"Knowledge and use of HIV test kits": "Knowledge of HIV test kits",
"Children with HIV should be allowed to attend school with children without HIV": "Kids w/ HIV allowed schooling w/ Kids w/o HIV"
}
# Rename the Specified Column
merged_dataFrame.rename(columns = rename_toLessWords_Dict, inplace = True)
# Create a new Column for the Average of the Survey Question Values
merged_dataFrame["Perception Mean"] = perception_dataFrame.mean(axis = 1)
percMean_dataFrame = merged_dataFrame["Perception Mean"]
# Display the Numerical Description of the new Dataframe
dataFrame_Print_Numerical_Description(merged_dataFrame)
| MIN | MEAN | MEDIAN | MODE | MAX | |
|---|---|---|---|---|---|
| Age in 5-year groups | 1.00 | 4.76 | 5.00 | 6.00 | 7.0 |
| Region | 1.00 | 9.09 | 9.00 | 13.00 | 17.0 |
| Type of place of residence | 1.00 | 1.60 | 2.00 | 2.00 | 2.0 |
| Native language of respondent | 1.00 | 16.87 | 6.00 | 2.00 | 96.0 |
| Highest educational level | 1.00 | 2.22 | 2.00 | 2.00 | 3.0 |
| Ever heard of a STI | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 |
| Ever heard of AIDS | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 |
| Condom used during last sex | 0.00 | 0.02 | 0.00 | 0.00 | 1.0 |
| Had any STI in last year | 0.00 | 0.99 | 1.00 | 1.00 | 1.0 |
| Had genital sore/ulcer last year | 0.00 | 0.97 | 1.00 | 1.00 | 1.0 |
| Had genital discharge last year | 0.00 | 0.94 | 1.00 | 1.00 | 1.0 |
| Ever been tested for HIV | 0.00 | 0.12 | 0.00 | 0.00 | 1.0 |
| Heard about other STIs | 0.00 | 0.37 | 0.00 | 0.00 | 1.0 |
| Used a method last sex | 0.00 | 0.59 | 1.00 | 1.00 | 1.0 |
| Justified asking husband with STI to use condom | 0.00 | 0.78 | 1.00 | 1.00 | 1.0 |
| Knows Anti-HIV transmission to baby Drugs | 0.00 | 0.54 | 1.00 | 1.00 | 1.0 |
| Would buy vegetables from vendor with HIV | 0.00 | 0.36 | 0.00 | 0.00 | 1.0 |
| Heard of ARVs | 0.00 | 0.33 | 0.00 | 0.00 | 1.0 |
| Respondent can refuse sex | 0.00 | 0.91 | 1.00 | 1.00 | 1.0 |
| Can ask to use condom | 0.00 | 0.75 | 1.00 | 1.00 | 1.0 |
| Knowledge of HIV test kits | 0.00 | 0.20 | 0.00 | 0.00 | 1.0 |
| Kids w/ HIV allowed schooling w/ Kids w/o HIV | 0.00 | 0.38 | 0.00 | 0.00 | 1.0 |
| Perception Mean | 0.18 | 0.60 | 0.59 | 0.59 | 1.0 |
FIND THE PERCEPTION MEAN GROUPED BY FEATURE
Let's see how Age, Region, Residence, Language, and Educational Level all correspond to the Perception Mean.
# Curate the List of Features
list_of_featureStrings = ["Age in 5-year groups", "Region", "Type of place of residence", "Native language of respondent", "Highest educational level"]
# Concatenate all the Dataframe Groups into One
list_of_groupDataFrames = [age_dataFrame, region_dataFrame, residence_dataFrame, language_dataFrame, educLevel_dataFrame, percMean_dataFrame]
merged_groups_dataFrame = pd.concat(list_of_groupDataFrames, axis = 1)
def dataFrame_Print_meanPerception_byGroup(dataframe: pd.DataFrame, feature: any, targetstring: str):
# Determine the Perception Mean (Count, Min, and Max included) for the Feature
meanPerception_dataFrame = dataframe.groupby(feature)[targetstring].agg([
("PERCEPTION COUNT", "count"),
("PERCEPTION MEAN", lambda x: round(x.mean(), 4)),
("PERCEPTION MIN", lambda x: round(x.min(), 4)),
("PERCEPTION MAX", lambda x: round(x.max(), 4))
])
# Display the Feature Group
display(meanPerception_dataFrame)
dataFrame_Print_meanPerception_byGroup(merged_groups_dataFrame, "Highest educational level", "Perception Mean")
| PERCEPTION COUNT | PERCEPTION MEAN | PERCEPTION MIN | PERCEPTION MAX | |
|---|---|---|---|---|
| Highest educational level | ||||
| 1 | 1807 | 0.5495 | 0.1765 | 0.8824 |
| 2 | 5825 | 0.5967 | 0.2353 | 1.0000 |
| 3 | 4394 | 0.6366 | 0.2353 | 1.0000 |
III. DATA VISUALIZATION
GRAPH THE PERCEPTION MEAN BOX PLOT
Let's visualize the Perception Mean Distribution by Educational Levels using a Box Plot.
# Define the Label Name Mapping
educLevel_group_mapping = {
1: "Primary",
2: "Secondary",
3: "Higher"
}
merged_groups_dataFrame["Highest Educational Level"] = merged_groups_dataFrame["Highest educational level"].map(educLevel_group_mapping)
educLevel_groups_order = ["Primary", "Secondary", "Higher"]
def add_box_traces(fig, df, educLevel_group_column, perception_mean_column, educLevel_groups):
# Add a box trace for each age group to the figure
for age_group in educLevel_groups:
fig.add_trace(go.Box(
y = df[df[educLevel_group_column] == age_group][perception_mean_column],
name = str(age_group),
boxpoints = "outliers",
jitter = 0.5,
whiskerwidth = 0.2,
marker = dict(size = 2),
line = dict(width = 1),
))
return fig
def add_vertical_line(fig):
# Add a vertical line at the Median of the figure
fig.add_shape(
type = "line",
x0 = -1,
y0 = 0.5,
x1 = 3,
y1 = 0.5,
yref = "paper",
line = dict(color="White", width = 2, dash = "dash"),
opacity = 0.5
)
return fig
def update_axes(fig, educLevel_group_column, perception_mean_column):
# Update the x and y axes of the figure
fig.update_xaxes(title_text = educLevel_group_column, showline = True, linewidth = 2, linecolor = "White", ticksuffix = " ", title_standoff = 25)
fig.update_yaxes(title_text = perception_mean_column, showline = True, linewidth = 2, linecolor = "White", ticksuffix = " ", range = [0, 1], title_standoff = 25)
return fig
def update_layout(fig):
# Update the layout of the figure
fig.update_layout(
title_text = f"<b>PERCEPTION MEAN BOX PLOT</b>",
title_font_family = "Roboto",
title_font_size = 40,
title_x = 0.5,
font_family = "Roboto",
font_size = 20,
width = 1000,
height = 500,
template = "plotly_dark",
margin = dict(l = 200, r = 250, t = 200, b = 150)
)
return fig
def create_box_plot(df, educLevel_group_column, perception_mean_column, educLevel_groups):
# Create a Box Plot with the given DataFrame and Columns
fig = go.Figure()
fig = add_box_traces(fig, df, educLevel_group_column, perception_mean_column, educLevel_groups)
fig = add_vertical_line(fig)
fig = update_axes(fig, educLevel_group_column, perception_mean_column)
fig = update_layout(fig)
fig.show()
# Display the figure
create_box_plot(merged_groups_dataFrame, "Highest Educational Level", "Perception Mean", educLevel_groups_order)
GRAPH THE PERCEPTION MEAN NORMAL DISTRIBUTION
Let's visualize the Normal Distribution of the Perception Mean by Educational Levels.
# Define the Label Name Mapping
educLevel_group_mapping = {
1: " Primary",
2: " Secondary",
3: " Higher"
}
merged_groups_dataFrame["Highest Educational Level"] = merged_groups_dataFrame["Highest educational level"].map(educLevel_group_mapping)
educLevel_groups_order = [" Primary", " Secondary", " Higher"]
def add_scatter_traces(fig, df, group_column, value_column, group_order):
# Add a scatter trace for each group to the figure
for group in group_order:
group_data = df[df[group_column] == group][value_column]
mu, std = sp.stats.norm.fit(group_data)
x = np.linspace(min(group_data), max(group_data), 100)
p = sp.stats.norm.pdf(x, mu, std)
fig.add_trace(go.Scatter(x=x, y=p, mode='lines', name=str(group)))
return fig
def add_vertical_line(fig):
# Add a vertical line at the Median of the figure
fig.add_shape(
type = "line",
x0 = 0.5,
y0 = 0,
x1 = 0.5,
y1 = 1,
yref = "paper",
line = dict(color = "White", width = 2, dash = "dash"),
opacity = 0.5
)
return fig
def update_axes(fig, x_title, y_title):
# Update the x and y axes of the figure
fig.update_xaxes(title_text = x_title, showline = True, linewidth = 2, linecolor = "White", ticksuffix = " ", range = [0, 1], title_standoff = 25)
fig.update_yaxes(title_text = y_title, showline = True, linewidth = 2, linecolor = "White", ticksuffix = " ", range = [0, 4], title_standoff = 25)
return fig
def update_layout(fig):
# Update the layout of the figure
fig.update_layout(
title_text = f"<b>PERCEPTION MEAN NORMAL DISTRIBUTION</b>",
title_font_family = "Roboto",
title_font_size = 40,
title_x = 0.5,
font_family = "Roboto",
font_size = 20,
width = 1000,
height = 500,
template = "plotly_dark",
margin = dict(l = 200, r = 250, t = 200, b = 150)
)
return fig
def create_grouped_scatter_plot(df, group_column, value_column, group_order):
# Create a grouped scatter plot with the given DataFrame and columns
fig = go.Figure()
fig = add_scatter_traces(fig, df, group_column, value_column, group_order)
fig = add_vertical_line(fig)
fig = update_axes(fig, "Perception Mean", "Probability Density")
fig = update_layout(fig)
fig.show()
# Display the figure
create_grouped_scatter_plot(merged_groups_dataFrame, "Highest Educational Level", "Perception Mean", educLevel_groups_order)